Setting up Tensorflow 1.X on Ubuntu 16.04 w/ GPU support

From dftwiki
Jump to: navigation, search

--D. Thiebaut (talk) 14:50, 31 August 2017 (EDT)


This page is quick log of the various steps I took to setup Tensorflow 1.3 (and 1.4) with Python 3.5 on an AMD 64-bit machine with an NVIDIA GPU card (GeForce GTX 960). Some of the steps are more detailed than others. The goal was to allow me to quickly redo the installation in case of trouble (which happened a few times). The second part contains the list of steps I took to compile tensorflow in a virtualenv setup, still on the same machine. The improvement in execution time I obtained, using the version compiled from source, is a 23% faster run time. So, compiling from source may improve your performance overall.


System Specs


Computer


-Computer-
Processor		: 8x AMD FX(tm)-8320 Eight-Core Processor
Memory		: 16165MB (2144MB used)
Operating System		: Ubuntu 16.04.3 LTS
User Name		: 
Date/Time		: Thu 31 Aug 2017 02:52:46 PM EDT
-Display-
Resolution		: 2048x1130 pixels
OpenGL Renderer		: Unknown
X11 Vendor		: The X.Org Foundation
-Multimedia-
Audio Adapter		: HDA-Intel - HDA ATI SB
Audio Adapter		: HDA-Intel - HDA NVidia
-Input Devices-
 Power Button
 Power Button
 CHICONY USB Keyboard
 CHICONY USB Keyboard
 Logitech Optical USB Mouse
 HDA ATI SB Front Mic
 HDA ATI SB Rear Mic
 HDA ATI SB Line
 HDA ATI SB Line Out Front
 HDA ATI SB Line Out Surround
 HDA ATI SB Line Out CLFE
 HDA ATI SB Line Out Side
 HDA ATI SB Front Headphone
 Eee PC WMI hotkeys
 HDA NVidia HDMI/DP,pcm		: 3=
 HDA NVidia HDMI/DP,pcm		: 7=
 HDA NVidia HDMI/DP,pcm		: 8=
 HDA NVidia HDMI/DP,pcm		: 9=
-Printers-
No printers found
-SCSI Disks-
ATA WDC WDS250G1B0A-
ATA WDC WD10EAVS-00D
TSSTcorp CDDVDW SH-224DB


GPU Specs


:~/ nvidia-smi 

Fri Sep  1 11:09:08 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82                 Driver Version: 375.82                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 960     Off  | 0000:06:00.0      On |                  N/A |
|  0%   23C    P8     6W / 160W |    633MiB /  1996MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1082    G   /usr/lib/xorg/Xorg                             330MiB |
|    0      1893    G   compiz                                         230MiB |
|    0      3092    G   /proc/self/exe                                  68MiB |
|    0     16624    G   unity-control-center                             1MiB |
+-----------------------------------------------------------------------------+


Option 1: Install TensorFlow Default Built, Version 1.3


# install 16.04.3 desktop on 250GB SSD
# use 16.04.3 DVD

# ...
# Once booted into SSD:

sudo apt-get install  emacs
sudo apt-get install openssh-server
sudo apt-get install ddclient 

# install nvidia drivers
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt update
sudo apt-get install nvidia-375

# get infinite loop on rebooting.
# reboot and hold shift
# pick first ubuntu linux option
# add "nomodeset" before "quiet splash"
# press F10
# reboot again ==> ok

sudo update-manager
--> settings
   --> Additional Drivers
       check that we are using NVIDIA 375.82

# go to NVIDIA tensorflow install page
# go to https://developer.nvidia.com/cuda-downloads
# download cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb

sudo dpkg -i cuda-repo-ubuntu1604-8-0-local-ga2_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

# had to do "sudo apt-get -f install " first to resolve all dependencies

# download or copy cudnn-8.0-linux-x64-v6.0.tgz to ~/Downloads
cd ~/Downloads
tar -xzvf cudnn-8.0-linux-x64-v6.0.tgz
sudo cp cuda/include/cudnn.h /usr/local/cuda/include
sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

# install pip

sudo apt-get install python-setuptools python-dev build-essential
sudo easy_install pip
sudo pip install --upgrade virtualenv

# make dir for virtualenv

mkdir ~/tensorflow_1

# start installing tensorflow

sudo apt-get install libcupti-dev

sudo apt-get install python3-pip python3-dev python-virtualenv
virtualenv --system-site-packages -p python3 ~/tensorflow_1

# try activate script

source ~/tensorflow_1/bin/activate
easy_install -U pip

pip install --upgrade tensorflow-gpu

# add exports to .bashrc

export PATH="/usr/local/cuda/bin:~/bin/:$PATH"
export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/cuda/lib64::/usr/local/cuda/extras/CUPTI/lib64"
export CUDA_HOME=/usr/local/cuda


Test Installation


tensor@hadoop0:~$ alias | grep activate

alias activateTensorflow='source ~/tensorflow_1/bin/activate'

tensor@hadoop0:~$ activateTensorflow 
(tensorflow_1) :~$ cd
(tensorflow_1) :~$ cd REG
(tensorflow_1) :~/REG$ cat testTensorflow.py
import tensorflow as tf

hello = tf.constant( 'Hello, Tensorflow!' )
sess = tf.Session()
print( sess.run( hello ) )
print( "running on Tensorflow Version", tf.__version__ )
(tensorflow_1) :~/REG$ python3 testTensorflow.py 

2017-09-02 10:53:41.359957: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 10:53:41.359990: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 10:53:41.359998: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 10:53:41.360005: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-02 10:53:41.601442: I tensorflow/core/common_runtime/gpu/gpu_device.cc:955] Found device 0 with properties: 
name: GeForce GTX 960
major: 5 minor: 2 memoryClockRate (GHz) 1.342
pciBusID 0000:06:00.0
Total memory: 1.95GiB
Free memory: 1.29GiB
2017-09-02 10:53:41.601481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:976] DMA: 0 
2017-09-02 10:53:41.601489: I tensorflow/core/common_runtime/gpu/gpu_device.cc:986] 0:   Y 
2017-09-02 10:53:41.601507: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:06:00.0)
b'Hello, Tensorflow!'
running on Tensorflow Version 1.3.0

(tensorflow_1) :~/REG$ deactivate 

:~/REG$



Option 2: Compile TensorFlow 1.4 from Source


  • Kept getting messages when running TensorFlow code that built was compiled without various support that was available on AMD chip, so decided to compile TensorFlow. Took several trials. The following steps document the compilation of TensorFlow 1.4 in VirtualEnv, under Ubuntu 16.04, for the same machine as in the previous section.
  • Main references:


Steps


  • Remove old installation of tensorflow
  • Reinstall virtualenv
  • Follow directions on tensorflow.org to install virtualenv, and then compiling tensorflow source.
  • activate virtualenv
  • download from git and follow directions
  • make sure to pick 5.0 (for GTX960 Nvidia GPU) as hardware compatibility when building
  • Here's a cleaned up history of commands:
  activateTensorflow  #(an alias that sources the activate script in ~/tensorflow_1, a dir I created for virtualenv) 

  # clone tensorflow
  git clone https://github.com/tensorflow/tensorflow 
  cd tensorflow
  git checkout

  # install python3 dependencies
  sudo apt-get install python3-numpy python3-dev python3-pip python3-wheel
  sudo apt-get install libcupti-dev

  # configure, accept all default except CUDA, and pick 5.0 for compatible hardware
  # which is required by GTX960 GPU  
  bazel clean
  ./configure 
  bazel build --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0"  --config=opt --config=cuda \
                         //tensorflow/tools/pip_package:build_pip_package

  bazel-bin/tensorflow/tools/pip_package/build_pip_package /tmp/tensorflow_pkg 

  # install
  sudo -H pip3 install /tmp/tensorflow_pkg/tensorflow-1.4.0.dev0-cp35-cp35m-linux_x86_64.whl


Testing


Test Program


# ~/REG/testTensorflow.py

import tensorflow as tf
import sys

hello = tf.constant( 'Hello, Tensorflow!' )
sess = tf.Session()
print( sess.run( hello ) )
print( "running on Tensorflow Version", tf.__version__ )
print( "running on Python Version", sys.version )


Output


deactivate

activateTensorflow 

cd ~/REG

python testTensorflow.py 

2017-09-14 07:14:26.059736: I tensorflow/core/common_runtime/gpu/gpu_device.cc:965] Found device 0 with properties: 
name: GeForce GTX 960 major: 5 minor: 2 memoryClockRate(GHz): 1.342
pciBusID: 0000:06:00.0
totalMemory: 1.94GiB freeMemory: 1.51GiB
2017-09-14 07:14:26.059777: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1055] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960, pci bus id: 0000:06:00.0, compute capability: 5.2)
b'Hello, Tensorflow!'
running on Tensorflow Version 1.4.0-dev
running on Python Version 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609]


Performance Comparison


I ran the training of an RNN over 100 iterations on various machines. This typically takes 2h30m on an AMD FX8320 8-core processor with 32GB of RAM (Hadoop0).

The hardware used was a MacbookPro (slowest), the AMD with and without GPU, Tensorflow from direct download versus compiled for the AMD, and AWS instances, some with GPU some without.

Machine Execution Time Price

Macbook

135 sec/iterations

--

AMD w/ compiled Tensorflow

2 h 04 min

--

AMD w/ compiled Tensorflow GPU turned OFF

1 h 53 min

--

AMD w/ compiled Tensorflow GPU turned ON

2 h 02 min

--

AWS c3.2xlarge

1 h 35 min

$0.21/hr

AWS c3.4xlarge

1 h 15 min

$0.84/hr

AWS p2.xlarge

1 h 27 min

$0.90/hr

AWS p2.xlarge with boosted clock

1 h 22 min

$0.90/hr

Paperspace

2 h 06 min

$0.40/hr + $5/mth


Macbook Pro

  • MacbookPro: takes twice as long per iteration as Hadoop0
16:00:38 134.04232 sec/iteration mse = 4.27171
16:19:26 135.09490 sec/iteration mse = 3.49061
16:28:38 134.81496 sec/iteration mse = 3.90631
16:47:45 135.87782 sec/iteration mse = 4.01627
16:56:33 135.54179 sec/iteration mse = 3.87095
(didn't wait for it to finish...)

AMD FX(tm)-8320 Eight-Core Processor

  • Hadoop0
  • 100 iterations with default built Tensorflow+GPU 1.3: 2 h 42 min 19 sec
  • 100 iterations with compiled Tensorflow+GPU 1.4: 2 h 04 min 07 sec
  • The compiled version of the program is 23% faster.
  • Hadoop0: 100 iterations with compiled Tensorflow+GPU 1.4, with GPU turned OFF:
1:31:59 68.98837 sec/iteration mse = 3.04447
1:37:32 68.85313 sec/iteration mse = 2.72302
1:43:05 68.72948 sec/iteration mse = 3.70767
1:48:38 68.61912 sec/iteration mse = 3.05828 
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 1:53:05
1:53:05 68.54140 sec/iteration mse = 2.51685
 
  • Hadoop0: 100 iterations with compiled Tensorflow+GPU 1.4, with GPU turned ON:
1:51:19 74.21151 sec/iteration mse = 3.97014
1:56:06 74.10998 sec/iteration mse = 2.03474
1:56:06 74.10998 sec/iteration mse = 2.03474
1:57:17 74.08389 sec/iteration mse = 3.24752
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 2:02:05
2:02:05 73.99128 sec/iteration mse = 2.03474

Amazon AWS

  • Amazon AWS c3.2xlarge EC2 instance (no GPU), High Frequency Intel Xeon E5-2680 v2, 8 vCPU, 15 GB RAM,
0:17:03 68.25731 sec/iteration mse = 4.10718
0:21:57 65.85389 sec/iteration mse = 3.37642
0:31:42 63.42248 sec/iteration mse = 3.17896
1:19:09 57.92274 sec/iteration mse = 2.24745
1:21:58 57.86898 sec/iteration mse = 3.82073
1:26:41 57.79624 sec/iteration mse = 3.49653
1:31:23 57.72434 sec/iteration mse = 3.85911
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 1:35:09
1:35:09 57.67420 sec/iteration mse = 2.24745
  • Amazon AWS c3.4xlarge EC2 Instance.
1:01:06 45.82601 sec/iteration mse = 2.00915
1:04:51 45.78664 sec/iteration mse = 3.41083
1:08:37 45.75064 sec/iteration mse = 3.76611
1:12:23 45.72430 sec/iteration mse = 2.90184
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 1:15:25
1:15:25 45.71390 sec/iteration mse = 2.00915

  • Amazon AWS p2.xlarge EC2 instance (with Tesla K80 GPU), High Frequency Intel Xeon E5-2686v4 (Broadwell) Processors, 4 vCPUs, 61 GB RAM.
with AMI-72717c09 (Ubuntu, GPU accelerated Tensorflow)
1:11:48 53.86068 sec/iteration mse = 3.63128
1:16:04 53.69830 sec/iteration mse = 2.99531
1:20:18 53.53553 sec/iteration mse = 3.69307
1:24:30 53.37746 sec/iteration mse = 3.34477
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 1:27:53
1:27:53 53.26431 sec/iteration mse = 2.43119

  • Amazon AWS p2.xlarge EC2 instance (with Tesla K80 GPU boosted to max speed)
with AMI-72717c09 (Ubuntu, GPU accelerated Tensorflow)
After boosting GPU Clock (see AWS page on this)
 sudo nvidia-smi -pm 1
 sudo nvidia-smi --auto-boost-default=0
 sudo nvidia-smi -ac 2505,875

Output
1:08:00 50.37704 sec/iteration mse = 2.08107
1:11:12 50.26984 sec/iteration mse = 4.33246
1:15:12 50.14071 sec/iteration mse = 3.46657
1:19:11 50.01312 sec/iteration mse = 4.18742 
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 1:22:22
1:22:22 49.92854 sec/iteration mse = 2.08107


PaperSpace


Same experiment on PaperSpace.com. The machine ($0.40/hr) has the following specs:

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2600.058
BogoMIPS:              5200.11
Hypervisor vendor:     Microsoft
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7


The GPU has the following specs.

Sat Oct  7 21:28:13 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        Off  | 0000:00:05.0      On |                  N/A |
| 48%   46C    P0    46W / 120W |   7792MiB /  8121MiB |     46%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2504    G   /usr/lib/xorg/Xorg                             178MiB |
|    0      2929    G   compiz                                          61MiB |
|    0      4363    C   python3                                       7549MiB |
+-----------------------------------------------------------------------------+


  • Main observation: the paperspace server (Ubuntu 14.04) is slower than the AMD server reported above:
AMD Server:  0:49:52 74.81598 sec/iteration mse = 4.30359
PaperSpace:  0:50:20 79.49551 sec/iteration mse = 2.31114


  • Execution time:
1:42:38 76.98630 sec/iteration mse = 3.69874
1:48:57 76.90594 sec/iteration mse = 3.52184
1:55:12 76.80817 sec/iteration mse = 3.68672
2:01:27 76.71252 sec/iteration mse = 4.17474
Model saved to /tmp/RNN_LRU_R_V8_model__20_20_60_5.00E-01_1.00E-03_118_1.00E-05_118_100_1_100_118_100_8
Execution Time: 2:06:29
2:06:29 76.66159 sec/iteration mse = 2.01788