Apr/11/2025 Updated by

「AlphaZero　深層学習・強化学習・探索人工知能プログラミング実践入門」

のコードが動く Docker コンテナ

[Up]

前提条件

nVidia のGPU を装備した Windows 11 マシンに、 Docker on Windows (GPU) の手順で Docker Desktop がインストールされている。
Docker Host (= Windows) の Ubuntu (WSL2) から docker コマンドが実行できる。

1. Docker Container を用意する

1.1. 作成すべき python 仮想環境

AlphaZero　深層学習・強化学習・探索人工知能プログラミング実践入門」のコードが動く Docker Container を作成する。

[本のサンプルで使用されている主要パッケージのバージョン (p.55)]
  python 3.6.7
  tensorflow 1.13.1
  numpy 1.14.6
  matplotlib 3.0.3
  pandas 0.22.0
  Pillow(PIL) 4.1.1
  h5py 2.8.0
  gym 0.10.11

1.2. nVidia GPU (CUDA) に対応した Docker Image を探す

ChatGPTに「nVidia が公開している docker のContainerで、tensorflow 1.13が使えるものはあるか？」と聞いたところ、返答は以下の通り。
tensorflow/tensorflow:1.13.1-gpu tensorflow/tensorflow:1.13.1-gpu-py3 これらのイメージは、TensorFlow 1.13.1 と CUDA 10 を組み合わせており、NVIDIA ドライバのバージョンが CUDA 10 をサポートしている必要があります。
ChatGPT の返答からリンクを辿って docker hub に行き着いた。
```
  tensorflow/tensorflow:1.13.1-gpu-py3-jupyter
```
次のコマンドで pull できる。

docker pull tensorflow/tensorflow:1.13.1-gpu-py3-jupyter

「nvidia-docker は、起動時に nVidia driver のユーザーモードコンポーネント GPU を Docker コンテナにマウントする」と記述されているので、「docker host となる Windows でnVidiaのドライバを最新にする」ことが重要で、「docker host である Windows自体にインストールされた CUDA や cuDNN のバージョンは関係ない」ようだ。

1.3. Docker Container を作成する

使用したい Docker Image が見つかったら、以下のコマンドでダウンロードして Docker Container を作成し、実行できる。

(例) 緑色の文字部分は環境に合わせて変更すること
docker run --gpus all -it --rm -v local_dir:container_dir tensorflow/tensorflow:1.13.1-gpu-py3-jupyter

以下の条件でコンテナを作成する。

割当	コンテナ名	ポート1 (22へ転送)	ポート2 (8888へ転送)	/root/doc マウント先	Docker Image
-	alphazero	7075	8085	/home/docker/alphazero	tensorflow/tensorflow:1.13.1-gpu-py3-jupyter

Docker Host である Windows 上の Ubuntu (WSL2) の対話環境の中で以下のコマンドを実行する。

 docker run --name alphazero --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all \
     -p 7075:22 -p 8085:8888 \
     -v /home/docker/alphazero:/root/doc \
     -it tensorflow/tensorflow:1.13.1-gpu-py3-jupyter

1.4. Docker Container の環境を整える (ssh)

Docker Guest の対話環境にアクセスする。

Docker Host の Windows 上で動作する Ubuntu (WSL2) において、docker コマンドを用いて alphazero コンテナに接続する。

docker attach alphazero

なぜか何も表示されない。^C-c をタイプして、相手が jupyter notebook の httpd サーバであることが後で判明した。

Docker Desktop から alphazero を選択して、"Exec" タブを選択するとシェルが起動して、対話環境が手に入る。

(動作中の jupyter notebook の token を知る)
jupyter notebook list

GPUが見えているかは以下のコマンドで確認した。


# python
Python 3.5.2 (default, Jan 26 2021, 13:30:48) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> print(tf.config.list_physical_devices('GPU'))
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: module 'tensorflow' has no attribute 'config'
>>> print(tf.config)
Traceback (most recent call last):
  File "", line 1, in 
AttributeError: module 'tensorflow' has no attribute 'config'
>>> from tensorflow.python.client import device_lib
>>> print(device_lib.list_local_devices())
2025-04-12 05:15:52.134689: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2025-04-12 05:15:52.314537: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2025-04-12 05:15:52.314798: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x47e2fa0 executing computations on platform CUDA. Devices:
2025-04-12 05:15:52.314846: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): NVIDIA GeForce RTX 3070 Laptop GPU, Compute Capability 8.6
2025-04-12 05:15:52.318474: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2304010000 Hz
2025-04-12 05:15:52.334564: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x493a180 executing computations on platform Host. Devices:
2025-04-12 05:15:52.334623: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): , 
2025-04-12 05:15:52.335763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: NVIDIA GeForce RTX 3070 Laptop GPU major: 8 minor: 6 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.95GiB
2025-04-12 05:15:52.335811: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2025-04-12 05:15:52.336088: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2025-04-12 05:15:52.336112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2025-04-12 05:15:52.336118: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2025-04-12 05:15:52.336246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1194] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2025-04-12 05:15:52.336320: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 6762 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 1455752447322421804
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 15660618604836127656
physical_device_desc: "device: XLA_GPU device"
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 14919214141591377901
physical_device_desc: "device: XLA_CPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 7090575770
locality {
  bus_id: 1
  links {
  }
}
incarnation: 18121662368334831392
physical_device_desc: "device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6"
]
>>>

openssh の環境を整える。パスワードの部分は推測されにくい文字列に変更すること。

apt-get update
apt-get upgrade -y --allow-unauthenticated
apt-get install -y openssh-server
echo 'root:パスワード' | chpasswd
sed -i "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config
service ssh start

これで、ポート番号を指定して外からdocker guest に ssh アクセスできるようになる。 IP アドレスは docker host のもの。

ssh -p 7075 root@133.99.41.195

rsync は別にインストールが必要のようだ。

apt install rsync

後で cmake をコンパイルするときに必要になるので openssh 開発用ライブラリもインストールしておく。

apt-get install libssl-dev
apt-get install -y openssh    ← openssh パッケージのインストールがなぜか失敗する。

1.5. Docker Container の環境を整える (tensorflow 1.13.1)

AlphaZero 実践入門のサンプルコードを送り込む。rsyncコマンドを使っているが、通信プロトコルは ssh で、docker guest への ssh ポート番号である 7075 を指定する。

rsync -avr -e "ssh -p 7075" sample root@133.99.41.195:/root/doc

(問題発生) 実行すると、次のエラーメッセージが表示されて、ファイルがコピーできない。

protocol version mismatch -- is your shell clean?
(see the rsync manpage for an explanation)
rsync error: protocol incompatibility (code 2) at compat.c(622) [sender=3.2.7]

if [[ $- == *i* ]]; then
    # echo を実行する行
fi

/etc/bash/bashrc の変更

*** bash.bashrc.org	Sat Apr 12 01:15:25 2025
--- bash.bashrc	Sat Apr 12 01:22:43 2025
***************
*** 19,24 ****
--- 19,27 ----
  alias grep="grep --color=auto"
  alias ls="ls --color=auto"
  
+ # added by nitta 2025/04/12
+ if [[ $- == *i* ]]; then
+ 
  echo -e "\e[1;31m"
  cat<<TF
  ________                               _______________                
***************
*** 48,50 ****
--- 51,56 ----
  
  # Turn off colors
  echo -e "\e[m"
+ 
+ fi
+ # till here by nitta

$ ssh -p 7075 root@133.99.41.195 echo ok

ch03のサンプルコードを動かす環境は以下の通り。

requirements.txt

tensorflow_gpu==1.13.1
numpy==1.14.6
matplotlib==3.0.3
panas==0.22.0
Pillow==4.1.1
h5py==2.8.0
gym==0.10.11

最初からインストールされているパッケージのバージョンを調べる。

root@4206f815ce4d:~# python
Python 3.5.2 (default, Jan 26 2021, 13:30:48)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.

pip uninstall python
pip install python==3.6.7

tensorflow は 1.13.1 がインストール済み。

>>> import tensorflow as tf
>>> print(tf.__version__)
1.13.1

numpy は1.14.6ではなく、1.16.3 であった。このまま進める。

>>> import numpy as np
>>> print(np.__version__)
1.16.3

matplotlib は3.0.3で本と同じであった。

>>> import matplotlib
>>> print(matplotlib.__version__)
3.0.3

pandas, PIL (Pillow), h5py, gym はインストールされていない。

>>> import pandas as pd
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named 'pandas'
>>> import PIL
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named 'PIL'
>>> import Pillow
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named 'Pillow'
>>> import h5py
>>> print(h5py.__version__)
2.9.0
>>> import gym
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named 'gym'
>>> from openai import gym
Traceback (most recent call last):
  File "", line 1, in 
ImportError: No module named 'openai'
>>>

デフォルトでは conda コマンドも venv コマンドもは存在しないようだ
新しい python 仮想環境を作成せずに、pandas, Pillow (PIL), gym パッケージをインストールする。
pandas パッケージをインストールする。

pip install pandas=0.22.0

pillow パッケージをインストールする。

pip install pillow==4.1.1

h5py パッケージをインストールする。

pip install h5py==2.8.0

gym パッケージをインストールする。

pip install gym==0.10.11

jpyter notebookを起動する。

自動的にjupyter nnotebookが起動している。さらに、ssh 経由で入ると、相手は jupyter notebook のようだ。

(他のマシンから)
ssh -p 7005 root@133.99.41.195
 ← 反応なし
^C-c をタイプすると
Serving notebooks from local directory: /tf
0 active kernels
The Jupyter Notebook is running at:
http://(4206f815ce4d or 127.0.0.1):8888/?token=c53633f5e846d47bf88103c6649065357720b3e020ef39cf
Shutdown this notebook server (y/[n])? No answer for 5s: resuming operation...

外部からアクセスする

http://133.99.41.195:8085/?token=a62e....e0ab

3_1_classification.ipynb でエラーがでる。

CUDAのバージョンチェック

# cat /usr/local/cuda/version.txt
CUDA Version 10.0.130
# nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2018 NVIDIA Corporation
Built on Sat_Aug_25_21:08:01_CDT_2018
Cuda compilation tools, release 10.0, V10.0.130
# nvidia-smi
Sat Apr 12 05:36:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.72                 Driver Version: 566.14         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3070 ...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   44C    P8             20W /  125W |    7037MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A        55      C   /python3.5                                  N/A      |
+-----------------------------------------------------------------------------------------+

NVIDIAドライバは docker ホストにだけ必要であるが、docker コンテナは自分の中に CUDA ランタイプやライブラリを含んでいる必要がある。そのうえで、コンテナからホストの NVIDIA カーネルモジュールにアクセスできるようにするのが nvidia-container-toolkit の役割。

nvidia-container-toolkit はインストールされていないようだ。何なら、使用している docker Container は昔の NVIDIA ドライバがインストールされているような気がする。 nvidia の公式サイトからダウンロードすべき。 https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow

tensorflow-1.13.1 を含むコンテナは
nvcr.io/nvidia/tensorflow:19.03-py3

# python
Python 3.5.2 (default, Jan 26 2021, 13:30:48) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>> print(tf.test.is_gpu_available())
2025-04-12 06:25:39.058560: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2025-04-12 06:25:40.821502: E tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:984] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2025-04-12 06:25:40.821722: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x44331b0 executing computations on platform CUDA. Devices:
2025-04-12 06:25:40.821796: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): NVIDIA GeForce RTX 3070 Laptop GPU, Compute Capability 8.6
2025-04-12 06:25:40.827096: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2304010000 Hz
2025-04-12 06:25:40.837139: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x458a370 executing computations on platform Host. Devices:
2025-04-12 06:25:40.837194: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): , 
2025-04-12 06:25:40.838116: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties: 
name: NVIDIA GeForce RTX 3070 Laptop GPU major: 8 minor: 6 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 8.00GiB freeMemory: 6.95GiB
2025-04-12 06:25:40.838163: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2025-04-12 06:25:40.838401: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2025-04-12 06:25:40.838425: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0 
2025-04-12 06:25:40.838445: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N 
2025-04-12 06:25:40.838552: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1194] Could not identify NUMA node of platform GPU id 0, defaulting to 0.  Your kernel may not have been built with NUMA support.
2025-04-12 06:25:40.838627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 6762 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3070 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6)
True

現時点ではここまで編集 Feb/24/2025

以下、未編集

2. Docker Container を使う

以下では Docker Host の IP Address を 133.99.41.195 と仮定する。

2.1. 他のマシンから Docker Container に ssh アクセスする

ポート番号(ここでは 7077)を指定して ssh で Docker Host ( 133.99.41.195 ) にアクセスする。ポート番号を変更することで個別の Dcoker Container にアクセスできる。

ssh -p 7077 root@133.99.41.195

ログインすると /root がカレントディレクトリとなる。
/root/doc というフォルダがDocker Host の /home/docker/4semi7　にマップされていて、永続的なフォルダとなる。すなわち、このフォルダ以外は Container が削除されると失なわれる。
Docker Host の Docker Desktop から Container の shell を起動した場合はカレントディレクトリが /workpace になる。いろいろなファイルがそこに置かれている場合がある。
[注意] もしも、「~/.ssh/known_hosts のキーと異なるので通信できない」というエラーが起きて ssh が終了する場合は、以下のコマンドで ~/.ssh/known_hosts のエントリを消すことで解決できる。

ssh-keygen -R '[133.99.41.195]:7077'

2.2. 他のマシンから Docker Container 上で jupyter を起動する。

ポート番号(ここでは 7077)を指定して ssh で Docker Host ( 133.99.41.195 ) にアクセスすると、 Docker Container の ssh に転送される。

ssh -p 7077 root@133.99.41.195

Docker Container 上で jupyter notebook を起動する。起動時に表示される URL (特に token ) を覚えておくこと。 token の値は毎回異なる。

jupyter notebook

http://localhost:8888/?token=...(略)
or 
http://127.0.0.1:8888/?token=...(略)

2.3. ブラウザでアクセスする

手元のPCでブラウザを起動する。Google Chrome を推奨する。
jupyter notebook が起動時に表示したのはローカルホストからアクセスする場合の URL である。他のマシンからアクセスするにはURL中の IP アドレスとポート番号を変更する。

(例)
jupyter が表示したURL:
  http://127.0.0.1:8888/?token=64b6f850fc2a1b9fb52c71b7cd3240d59870619b32a4b4e7

(変更点)
IPアドレス: 127.0.0.1 (または localhost) →  133.99.41.195
ポート番号: 8888 → 8087 

(アクセスするURL)
  http://133.99.41.195:8087/?token=64b6f850fc2a1b9fb52c71b7cd3240d59870619b32a4b4e7

[注意] jupyter notebook は 8888 番ポートが使えない場合に、異なるポート番号(8887 など)を使うので、外部からアクセスできなくなる。

jupyter notebook が異常終了したときに、しばらく(5分程度) 8888 番ポートをつかんだままになることがある。そのような状態で jupyter notebook 再起動すると、8888番ポート以外のポート番号を使って起動し、外部からアクセスできなくなる。しばらく(5分程度)待ってから jupyter notebook を起動すること。

「AlphaZero 深層学習・強化学習・探索 人工知能プログラミング実践入門」