Julia에 TensorFlow을 설치시 특히 GPU 지원 버전 설치시 내 서버에 설치된 cuda 버전과 추가한 tensorflow build 버전이 맞지 않는 경우
커스텀으로 tensorflow를 빌드 하고 후속 작업을 해야 한다.
먼저 Julia TensorFlow 패키지를 설치 하고 GPU 버전으로 컴파일 한다.
Julia>ENV["TF_USE_GPU"] = "1"
Pkg>add TensorFlow#master
Pkg>build TensorFlow
빌드된 TensorFlow 이미지는 예를 들어 /home/shpark/.julia/packages/TensorFlow/JljDB/deps/usr/bin 에 생성된다.
정상적으로 생성되면 아래 두개의 파일이 생성된다
libtensorflow.so
libtensorflow_framework.so
Julia에서 아래 코드를 실행 했을 때 에러나 현제 설치된 cuda버전과 다른 라이브러리를 찾는다면 위의 .so 라이브러리를 삭제하고
커스텀 빌드를 한다.
Julia> using TensorFlow
Julia> TensorFlow.Session()
Tensorflow 1.10 ~ 1.12 버전은 Bazel 버전 0.18 버전으로 빌드 한다
1.3 이 후 부처 Bazel 0.26.1 로 한다
기존에 높은 버전의 Bazel이 설치 되어 있다면 아래와 같이 삭제 한다.
bazel shutdown
rm $HOME/.cache/bazel -fr
sudo rm /usr/local/bin/bazel /etc/bazelrc /usr/local/lib/bazel -fr
Bazel 0.26.1 다운로드 (더 높은 버전에서는 빌드가 안됨)
wget https://releases.bazel.build/0.26.1/release/bazel-0.26.1-installer-linux-x86_64.sh
chmod +x bazel-0.26.1-installer-linux-x86_64.sh
./bazel-0.26.1-installer-linux-x86_64.sh --user
Tensorflow 1.13.1다운로드
빌드를 위해 nccl도 필요 하다
nccl 다운로드 에서 파일을 받아 빌드 한다.
빌드 후 아래와 같이 nccl 라이브러리를 복사 해준다
cd build
sudo mkdir -p /usr/local/cuda/nccl/lib /usr/local/cuda/nccl/include
sudo cp *.txt /usr/local/cuda/nccl
sudo cp include/*.h /usr/include/
sudo cp lib/libnccl.so.2.5.6 lib/libnccl_static.a /usr/lib/x86_64-linux-gnu/
sudo ln -s /usr/include/nccl.h /usr/local/cuda/nccl/include/nccl.h
cd /usr/lib/x86_64-linux-gnu
sudo ln -s libnccl.so.2.5.6 libnccl.so.2
sudo ln -s libnccl.so.2 libnccl.so
for i in libnccl*; do sudo ln -s /usr/lib/x86_64-linux-gnu/$i /usr/local/cuda/nccl/lib/$i; done
받은 Tensorflow를 풀고 tensorflow 폴더로 이동한다.
./configure
Do you wish to build TensorFlow with CUDA support? [y/N]:y
Tensorflow build
bazel build --config=opt --config=cuda --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" //tensorflow:libtensorflow.so
Build완료 후 (Tensorflow package의 라이브러리가 복사되는 곳(예시) /home/shpark/.julia/packages/TensorFlow/JljDB/deps/usr/bin )
bazel shutdown
cd bazel-bin/tensorflow
cp -a * /home/shpark/.julia/packages/TensorFlow/JljDB/deps/usr/bin
cd /home/shpark/.julia/packages/TensorFlow/JljDB/deps/usr/bin
cd ln -s libtensorflow_framework.so.2 libtensorflow_framework.so
ls -al
lrwxrwxrwx 1 shpark shpark 28 Dec 4 02:33 libtensorflow_framework.so -> libtensorflow_framework.so.2*
lrwxrwxrwx 1 shpark shpark 32 Dec 4 02:26 libtensorflow_framework.so.2 -> libtensorflow_framework.so.2.0.0*
-r-xr-xr-x 1 shpark shpark 34342328 Dec 4 02:26 libtensorflow_framework.so.2.0.0*
-r-xr-xr-x 1 shpark shpark 22121 Dec 4 01:54 libtensorflow_framework.so.2.0.0-2.params*
lrwxrwxrwx 1 shpark shpark 18 Dec 4 02:27 libtensorflow.so -> libtensorflow.so.2*
lrwxrwxrwx 1 shpark shpark 22 Dec 4 02:27 libtensorflow.so.2 -> libtensorflow.so.2.0.0*
-r-xr-xr-x 1 shpark shpark 539673528 Dec 4 02:27 libtensorflow.so.2.0.0*
-r-xr-xr-x 1 shpark shpark 143213 Dec 4 01:55 libtensorflow.so.2.0.0-2.params*
Julia에서 아래코드 실행시
Julia> using TensorFlow
Julia> TensorFlow.Session()
아래와 같이 GPU 정보가 나오면 성공 한것임
2019-12-04 02:38:29.094119: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2598210000 Hz
2019-12-04 02:38:29.096951: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3569360 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2019-12-04 02:38:29.096994: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2019-12-04 02:38:29.100743: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2019-12-04 02:38:29.648171: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 0 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:05:00.0
2019-12-04 02:38:29.649442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 1 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:06:00.0
2019-12-04 02:38:29.650699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 2 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:09:00.0
2019-12-04 02:38:29.651941: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1551] Found device 3 with properties:
name: TITAN Xp major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:0a:00.0
2019-12-04 02:38:29.652168: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2019-12-04 02:38:29.653514: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2019-12-04 02:38:29.654750: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2019-12-04 02:38:29.654994: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2019-12-04 02:38:29.656261: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2019-12-04 02:38:29.656903: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2019-12-04 02:38:29.659538: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2019-12-04 02:38:29.671761: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1679] Adding visible gpu devices: 0, 1, 2, 3
2019-12-04 02:38:29.671796: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2019-12-04 02:38:29.677085: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1092] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-12-04 02:38:29.677102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1098] 0 1 2 3
2019-12-04 02:38:29.677111: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1111] 0: N Y Y Y
2019-12-04 02:38:29.677117: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1111] 1: Y N Y Y
2019-12-04 02:38:29.677124: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1111] 2: Y Y N Y
2019-12-04 02:38:29.677130: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1111] 3: Y Y Y N
2019-12-04 02:38:29.687338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11436 MB memory) -> physical GPU (device: 0, name: TITAN Xp, pci bus id: 0000:05:00.0, compute capability: 6.1)
2019-12-04 02:38:29.690090: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11439 MB memory) -> physical GPU (device: 1, name: TITAN Xp, pci bus id: 0000:06:00.0, compute capability: 6.1)
2019-12-04 02:38:29.692745: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 11439 MB memory) -> physical GPU (device: 2, name: TITAN Xp, pci bus id: 0000:09:00.0, compute capability: 6.1)
2019-12-04 02:38:29.695417: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1237] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 11439 MB memory) -> physical GPU (device: 3, name: TITAN Xp, pci bus id: 0000:0a:00.0, compute capability: 6.1)
Session(Ptr{Nothing} @0x00007ff5a081c070)
그리고 아래와 같이 확인 해볼 수도 있다. 위 코드를 실행 하면 julia아 GPU메모리를 차지 한것이 보인다.
$ nvidia-smi
Wed Dec 4 02:57:46 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 TITAN Xp On | 00000000:05:00.0 Off | N/A |
| 0% 21C P8 15W / 250W | 155MiB / 12193MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN Xp On | 00000000:06:00.0 Off | N/A |
| 0% 21C P8 8W / 250W | 155MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 TITAN Xp On | 00000000:09:00.0 Off | N/A |
| 0% 22C P8 8W / 250W | 155MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 TITAN Xp On | 00000000:0A:00.0 Off | N/A |
| 0% 18C P8 7W / 250W | 155MiB / 12196MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 17252 C julia 145MiB |
| 1 17252 C julia 145MiB |
| 2 17252 C julia 145MiB |
| 3 17252 C julia 145MiB |
+-----------------------------------------------------------------------------+