Tutorial: Docker Anaconda Python -- 4

From dftwiki
Jump to: navigation, search

Tutorial: Docker Anaconda Python --D. Thiebaut (talk) 17:46, 16 July 2018 (EDT)

This is Part 4 of a tutorial on using Docker. In this tutorial we explore the creation of containers containing all the environment needed to explore data-science applications in Python: Anaconda, as well as Jupyter notebooks. The top level of this tutorial can be found here.



From Wikipedia:

Anaconda is a free and open source distribution of the Python and R programming languages for data science and machine learning related applications (large-scale data processing, predictive analytics, scientific computing), that aims to simplify package management and deployment. Package versions are managed by the package management system conda. The Anaconda distribution is used by over 6 million users, and it includes more than 250 popular data science packages suitable for Windows, Linux, and MacOS.

Anaconda contains over 1,000 data packages and 100s of packages. This page lists all the packages contained in the Mac OSX Anaconda package, 598 in total.
The main advantage of Anaconda, is that all the packages it contains are version-compatible with each other, and, if installed in a container, will not interfere with any of the packages installed on your host computer (laptop). However, such a container will be huge. The one we will create in the exercise below is 9.7 GB in length. That is a significant amount of space.

Exercise 5: Creating a Container with Anaconda


towardsDataScience: Not Reinventing the Wheel

We could create a simple Docker file based on the latest Ubuntu container, and install Anaconda, and then refine the installation. Instead, we'll use one of the Anaconda Docker public images that have already been created, and either download it or recreate it on our system.
We found https://towardsdatascience.com/docker-for-data-science-9c0ce73e8263 towards-data-science] to provide a solid container, and we will use their product here.

Option 1: Pull the Public towardsdatascience Image

The fastest option (relatively) is to pull their public Docker image:

We can pull their public image from the Docker Hub. You need to login to the Docker Hub from the command line first.

docker login
docker pull evheniy/docker-data-science
Using default tag: latest
latest: Pulling fromevheniy/docker-data-science
cc1a78bfd46b: Pull complete 
314b82d3c9fe: Pull complete 
adebea299011: Pull complete 
f7baff790e81: Pull complete 
Digest: sha256:e07b9ca98ac1eeb1179dbf0e0bbcebd87701f8654878d6d8ce164d71746964d1
Status: Downloaded newer image for evheniy/docker-data-science:latest

This took almost 10 minutes on a MacBook Pro 2016 with Wifi connection.

Option 2: Dockerfile

The other option (which probably takes the same amount of time) is to recreate their Dockerfile in a directory of your choice, customize it if needed (such as adding a new user, for example), and building it. Here's our customized version of it.

# We will use Ubuntu for our image
FROM ubuntu:latest

# Updating Ubuntu packages
RUN apt-get update && yes|apt-get upgrade
RUN apt-get install -y emacs

# Adding wget and bzip2
RUN apt-get install -y wget bzip2

# Add sudo
RUN apt-get -y install sudo

# Add user ubuntu with no password, add to sudo group
RUN adduser --disabled-password --gecos '' ubuntu
RUN adduser ubuntu sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER ubuntu
WORKDIR /home/ubuntu/
RUN chmod a+rwx /home/ubuntu/
#RUN echo `pwd`

# Anaconda installing
RUN wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh
RUN bash Anaconda3-5.0.1-Linux-x86_64.sh -b
RUN rm Anaconda3-5.0.1-Linux-x86_64.sh

# Set path to conda
#ENV PATH /root/anaconda3/bin:$PATH
ENV PATH /home/ubuntu/anaconda3/bin:$PATH

# Updating Anaconda packages
RUN conda update conda
RUN conda update anaconda
RUN conda update --all

# Configuring access to Jupyter
RUN mkdir /home/ubuntu/notebooks
RUN jupyter notebook --generate-config --allow-root
RUN echo "c.NotebookApp.password = u'sha1:6a3f528eec40:6e896b6e4828f525a6e20e5411cd1c8075d68619'" >> /home/ubuntu/.jupyter/jupyter_notebook_config.py

# Jupyter listens port: 8888

# Run Jupytewr notebook as Docker main process
CMD ["jupyter", "notebook", "--allow-root", "--notebook-dir=/home/ubuntu/notebooks", "--ip='*'", "--port=8888", "--no-browser"]

Notebooks directory

The container will run with its internal /home/ubuntu/notebooks directory mounted with our local $PWD/notebooks directory.

mkdir notebooks


docker build -t toward-data-science .

This will take quite a long time (around 15 minutes) and generate a huge log.

Running a Jupyter Notebook

We run the newly created container first:

docker run --name toward-data-science -p 8888:8888 --env="DISPLAY" \
      -v "$PWD/notebooks:/home/ubuntu/notebooks" -d toward-data-science

Then we proceed to open the following URL on our browser: http://localhost:8888


Enter "root" as the password. Then create a simple notebook to test various libraries



This conclude this exercise.

Click here to go to Part 5 of this tutorial.